Skip to content

fix: make tool call assertions flexible to prevent CI flakes#5295

Open
iamemilio wants to merge 2 commits intollamastack:mainfrom
iamemilio:fix/flexible-tool-call-assertions
Open

fix: make tool call assertions flexible to prevent CI flakes#5295
iamemilio wants to merge 2 commits intollamastack:mainfrom
iamemilio:fix/flexible-tool-call-assertions

Conversation

@iamemilio
Copy link
Copy Markdown
Contributor

@iamemilio iamemilio commented Mar 25, 2026

Summary

  • Loosens strict len(response.output) == 1 assertions in tool call integration tests to accept one or more function calls (>= 1), since models may legitimately return multiple parallel tool calls for a single-tool prompt
  • Updates follow-up turns to respond to all returned tool calls (not just the first), preventing the tool_call_id not responded to API error when the model produces duplicate invocations
  • These changes make tests resilient to valid model behavior variations across different providers and model versions without weakening what is actually being tested

Motivation

When evaluating cheaper model alternatives for the Azure CI suite, we observed that some models (e.g. gpt-4.1-nano) occasionally return two parallel get_weather function calls instead of one. This is a valid API response — the model is correctly identifying the tool to call, just being eager about parallelism. The previous == 1 assertions treated this as a failure, creating intermittent CI flakes.

The fix is backwards-compatible: tests still pass when a model returns exactly one call (since 1 >= 1), while also accepting the multi-call case. The core behavior being verified (correct tool invocation, successful follow-up round-trip, final message with text) remains unchanged.

Test plan

  • Ran the full responses suite with gpt-4.1-nano via OpenAI — 196 passed, 0 failed, 26 skipped
  • Verified the previously flaky tests (test_function_call_output_list_text, test_function_call_output_list_text_multi_block, test_response_non_streaming_custom_tool) pass consistently with the loosened assertions
  • Assertions remain logically correct — still verify function_call type, status, name, and successful text response after tool output

Some models return multiple parallel tool calls for a single-tool prompt,
which is a valid API response. The previous assertions required exactly one
function call, causing intermittent CI failures when models produced
logically correct but duplicated tool invocations. This relaxes the
assertions to accept one or more function calls and responds to all of
them in follow-up turns, preventing the "tool_call_id not responded to"
error that occurs when only the first call is acknowledged.

Signed-off-by: Emilio Garcia <i.am.emilio@gmail.com>
Made-with: Cursor
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 25, 2026
Copy link
Copy Markdown
Collaborator

@mattf mattf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you pass parallel_tool_calls=False to create instead?

it'd be good to have a clear parallel_tool_calls=True test too. historically many models would fail it, but we should behave correctly. especially important is parallel_tool_calls=True and stream=True.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants